LSC Storage Server Market Overview

Introduction

Much attention is focused on "open systems" in today's computer market, and with good reason. The world at large has agreed that inter-operability should be the paramount goal of computer vendors - and vendors have responded with open systems solutions.

The evolution of the industry towards open systems has occurred in stages. The current stage has compute power distributed to all points of the network infrastructure. The issue that LSC was formed to deal with is the next "wave" of this evolution, where the data requirements of the network exceed its ability to handle them. Widely distributed networks make this job even more difficult.

Data is the most important asset of any enterprise. Until adequate measures are adopted to manage it effectively, those assets are at risk.

Growing Storage Requirements

As distributed computer networks carry more mission-critical applications, data availability and storage become vital issues. Hardware vendors have developed storage devices which can hold vast amounts of data. Fault tolerance and redundancy serve to reduce data loss from equipment failure.

However, beyond the continuing enhancements in available hardware there are other equally vital issues, including file accessibility, long-term storage (archiving), control of sensitive corporate data, version control and recovery from disasters.

The Extent of the Data Explosion

How much data are we talking about? According to industry analysts, the average Fortune 1000 company manages over one terabyte of data today (a terabyte is 1,000,000 megabytes). By the year 2000, the average Fortune 1000 company will manage over one petabyte (one billion megabytes). As a point of reference, storing one terabyte on 9-track tape requires 6,666 reels, at a cost of over $100,000.

Strategic Research Corporation (formerly Peripheral Strategies), a Santa Barbara, California-based market research firm, projects that active data storage on the average network is growing at 60% per year, and is expected to top 41GB by 1997 (EDGE: Work-group Computing Report, April 1994, v5n204 p6(1)). The increase in storage needs can be attributed to several factors. Applications such as imaging and CAD/CAM, as well as graphical user interfaces, produce files that can average several megabytes each. In the corporate environment, data is migrating from the mainframe to the personal computer-based network. As LAN fault tolerance increases, the data migration is likely to accelerate. In addition, the cost for on-line storage continues to fall, leading more companies to keep more data on-line.

What's missing in most networks are the tools to effectively manage all that data.

Managing Distributed Data

It is expected that in the near future, companies will keep all their data on-line (or near-line, which means availability via robotically-controlled jukebox). This is due to the reduction in cost of on-line storage, and the fact that automatic access to archives provides significant cost benefits in terms of labor. The "intangible" benefit of having archival data on-line is that users are much more likely to refer to past work, enabling the company to benefit from past experience, and avoid past mistakes. In addition to needing massive amounts of data storage, some mechanism is needed for keeping track of all the files on the system. For instance, if a user wishes to access a file which has not been accessed for 10 years, it may have been migrated to some form of storage the user knows nothing about. A critical mechanism is a method of pointing to the file's ultimate residence even though the user only knows the file's name.

Static vs. Dynamic Data

In the past, computing environments have been designed to meet "horizontal" needs, that is, general computing needs for an entire organization. As such, all applications were processed by common processors, and data was stored and backed up without regard to its status.

As storage requirements increase, it is important to look at how those procedures might be improved. For instance, some data is vital to the organization and must be kept for a long period. Other data is important, but short-lived. Some data is generated as a "by-product" of another process and is not useful after its immediate creation. Some is accessed frequently, some rarely, some never.

If all the data in the enterprise is backed up every day, without regard to its status, it may reflect significant waste and potential flaws in a storage management procedure. For example, data which does not change need not be backed up over and over. This is not to suggest that important data should not be safeguarded, but that significant savings might be achieved by removing it from the daily backup process. Determining which data fits this description is a useful exercise in analyzing the data "profile" of an organization. Removing data from the backup process while leaving it accessible to users is a key feature of automated archiving systems.

Shortcomings of Today's Backup Solutions (Archiving and Disaster Control)

Backup cannot be used to manage all the data in an enterprise. This is because a "backup copy" of the data is used to restore from individual workstation malfunctions, inadvertent erasures, etc., and must remain on site - rendering it useless in a disaster. For instance, backing up two terabytes could take weeks - it would take at least that long to restore that amount, too. It is not contested that most businesses could not survive an interruption of that length. Using a backup copy for a "full restore" of a large site is problematic, and in practice often fails.

A well-designed storage management system provides the mechanisms to implement a disaster control system. The term disaster control is used to include both disaster planning and disaster recovery, which is much different than backup and recovery. Conventional backups are not suitable for a large-scale disaster recovery system.

The high cost of magnetic disk is often used as justification for the need for a more extensive storage management system. There is another reason which is more compelling. In storing enormous amounts of data on magnetic disk, a disk failure is catastrophic considering the value of that data. The selected storage management system should reduce the need for magnetic disk - greatly minimizing that risk.

Data Storage Cost -
Cost of Accessing Archived Data

Typical systems for storing archived data use "off-line" storage, usually magnetic tape. Tape is chosen because it is inexpensive and removable. Data that is archived to tape (i.e. removed from on-line storage), is accessed either manually or robotically (via jukebox). Libraries of tapes which exceed the jukebox capacity are kept off-line and mounted as needed. Depending on the number of tapes and the size of the organization, "mounting" the tape can take seconds or days.

Organizations are looking at the desirability of keeping data on- or near-line. As these on-line archives grow, and as data addition rates increase, organizations are looking more negatively at the economics of off-line storage. Cost components consist of labor, media, and maintenance of both the file system and the physical storage device. Although up-front costs are sometimes high, on- or near-line storage can drastically reduce the cost of supporting data archives.

The problem is that computer systems have not been designed to manage these tremendous volumes of data. During the first three computing "waves," the economics of keeping large amounts of data on- or near-line were not feasible. Mainframe data managers were designed to keep tight control on large amounts of mostly off-line data. The new higher magnitudes of generated data are creating the need for new paradigms. In some instances, backing up a system can be done with well-established products and procedures. However, if the volume of data makes it physically impossible to do a backup within a certain timeframe, some radical changes are required.

Naturally, this radical change will require time to become "standard operating procedure" and there are various products which vie to become the solution. The time has come for many organizations to make a choice - there is too much at stake (such as the survival of the enterprise) to delay longer. The problem of providing safe storage for vast amounts of data and allowing fast access to it is the main focus of a specialized storage server and its associated storage management system.

Data Integrity

Because the "conventional" data management process has been to keep the majority of data off-line, data integrity has been dealt with simply. Data retains its integrity as long as the backup media can retain the information. Storing data as paper printouts provides very good retention and integrity.

Conventional off-line methods are adequate if the need to "re- process" data is small. As this need grows, more care must be taken to select storage media.

The data storage explosion raises a number of important data integrity issues. First, there is much more data now than there was. Second, "cross-platform compatibility" allows much of the data to be "re-used." And third, the data is not sitting idle in a vault. It is being accessed and shuffled in and out of drives.

The first issue requires an archival system to be high-capacity. The second requires easy access and robust error recovery, disaster recovery, and version control. The third requires a media with a long life and durable construction. And since the value of the data dwarfs the value of the hardware, it is vital that the data be accurately recorded in a manner consistent with national and industry standards to be accessible later.

Network File Transfer Performance

The main focus of data storage is safe storage and fast access. Since the data user will almost always be physically removed from the storage location, fast access will require high-speed network connectivity. A storage server must focus on delivering the information in a timely manner. This can be expressed in many ways, and different organizations will have different requirements.

It should be stated that fast access to data cannot easily be expressed as a single number, such as MIPS or FLOPS. This is because a storage system includes processing user requests as well as retrieving data from on-line, near-line, and off-line sources. The fastest processor in the world will not be efficiently utilized if the data is on a slow storage device, or the media on which it resides is not mounted.

Fast access to data must be balanced with cost, longevity and security needs of the organization. This again suggests separating data which needs "instant" access from data which is accessed rarely.

Problems

Proliferation of Workstation & Server Hard Disks

The cost of storing data is worthwhile to consider. Take the average Fortune 1000 company and its terabyte of data - where is it? How much of it is on magnetic disk? How much has been accessed in the last 30 days, the last 60, the last year? While the cost of purchasing magnetic disk is dropping rapidly, the cost to manage the disk is very high. According to Michael Peterson of Strategic Research Corporation the annual cost of managing a hard-disk drive is $7 per megabyte, including file management, backup, archiving, installation and repair (PC Week, August 22, 1994, v.11 n. 33 p. N3). Based on that estimate, the annual cost of managing a nine gigabyte drive is over $60,000. How much is it costing the Fortune 1000 company to manage its terabyte of data? A major inefficiency in distributed data storage is that static data is resident on magnetic disks throughout the organization. This data is at risk. A major contributor to the need for increased magnetic disk storage is the growth in data volumes - including those which are static.

Proliferation of Backup and Backup Management Systems

The so-called "third wave" of computing produced a variety of computer systems, from supercomputers to personal digital assistants, all communicating over networks of various sorts. This is accompanied by an increase in data storage, and the resultant increase in backup systems. In most cases, each system required its own backup system. This has resulted in a large installed base of "legacy" backup systems, which are costly to maintain in terms of technology and personnel. They are operator-intensive, meaning that most are not automatic - a conscious effort to back up must be made. These backup systems scale poorly, and as more data is added, they are stretched to their limits. As archives build up, more and more data is potentially lost (in the sense of "not knowing where it is"), and storage area needs grow. As time goes on, these problems grow worse.

Storage Media Volatility

The selection of appropriate storage media is fundamental to the design of the data storage system. Each media has cost, longevity, and access trade-offs associated with it. These trade-offs are manageable - for instance, a media with a short life can be used for business-critical data if it is backed up daily, or to multiple copies. Inactive but important data relegated to removable storage must be on a long-life media, to assure its integrity.

The subject of media choices could fill multiple volumes - however, it is sufficient to say that the variety of data and the varying requirements in terms of access, life span and cost will necessitate close scrutiny to the choice of media and the method of recording data.

Data Formats

Now that systems can communicate within a network environment, attention can be focused more on what is perhaps the more important issue - how is the data stored? It's a safe bet that whatever media data is stored on today will be supplanted by a future media at less cost and higher capacity. Committing to a truly open system should start with a commitment to storing the data in an open manner.

Limitations in the UNIX Native File System

To handle very large data storage needs, multiple storage device types are needed. This is implicit in the need for capacity, performance, integrity, and disaster control.

File System Limitations
The UNIX operating system cannot spread a file system across multiple media. This means that for each disk or other storage device there is a complete UNIX file system ("mount point"), and administration tasks must be performed on each one individually. This limitation is especially critical when you consider the cataloging needs of an archive system. The maximum file system size is limited based on fixed allocation of inodes. This may result in not effectively partitioning and managing the media, If the capacity of that media exceeds the maximum file system size.

Disk Storage Limitations
UNIX file allocation schemes based on a single disk allocation unit (DAU) size are never ideal. Although selectable by the administrator at initialization of the disk, compromises must be made -- fast access or efficient utilization of space. And UNIX provides no way to ensure contiguous writing of files, critical for fast access times.

Error Recovery
A proper shutdown of the UNIX system ensures that all system inodes, buffers and superblocks have been flushed to the disk. An unexpected shutdown can leave the file system in a state of disarray that can cause a long delay in re-starting the system. When the system is re-started, there is no guarantee that all files will be restorable.

As alluded above, "Unexpected events" can cripple a UNIX system. If a system were to grow to a very large number of files a sudden power interruption could be catastrophic.

Disasters
The need for disaster planning is not contested. As the data capacity grows, the need for disaster planning grows with it. UNIX has the basic inability to manage removable media. This means that it alone cannot provide the mechanism for disaster recovery.

Solutions

Virtually all companies are faced with the issues that have been briefly mentioned above. There are a number of possible courses of action. The first is to do nothing, and continue on with present methods. This short-term "solution" can have serious impact on a firm's long-term existence. Another solution is to select a single backup system and try to consolidate all the backup platforms into a single, monolithic system. This is difficult, given cross-platform and inter-departmental issues. More problematic is that the backup systems are stretched to their limits, so rolling them up into a single, LARGER, system does not make sense. A common sense approach is to adopt a platform which relieves the problems associated with backup, provides that service "enterprise-wide," and does it with a single hardware/software solution.

Backup is Only Part of the Solution

Conventional backup is necessary to recover files erased inadvertently. The nature of backups in a large automated and heterogeneous environment is that data is copied multiple times throughout the system. This is costly in terms of media use. Because the overall storage requirements are growing so large, the load on these backup systems must be lightened. In addition to "backup window" problems, overloaded backups also present a real threat to the integrity of the backups, and can increase the chances of multiple versions of the same data existing on-line or overwriting on-line data with older versions. To lighten the load on backup, an automated archiving system removes data which is safely archived from the backup routine. This also forms the basis of the disaster recovery system.

Basic Requirements

Until now, there has been no single solution to the problems of storing, archiving and backing up this growing mass of data. The requirements of the solution have, most of all, to do with the "change dynamics" of the system. In other words, how long can the storage methods be used without having to convert data or migrate to a new platform? Upward mobility in terms of storage devices and microprocessors is an absolute must. Data written to ANSI recording standards is portable to any future system. In the event of a disaster - the ultimate "change" - the data must move in a timely and transparent manner to the new system.

The IDS/SAM system provides this automated archiving environment.

The IDS/SAM Solution

The Integrated Data Server(TM) (IDS) and Storage and Archive Manager File System(TM) (SAM-FS)

The solution available today is IDS/SAM. Because use of computer networks is now accelerating, and the rate at which we accumulate data is accelerating even faster, this network data server presents the most scaleable, open solution while safeguarding the data of the organization. Since the accumulated data will likely be kept on- or near-line, scalability is a key feature.

There are many aspects to this issue which not only hinge on future needs, but on those of today. A prominent industry analyst recently wrote, "People managing local-area-networks today are re-learning the data center of 25-30 years ago." Corporations downsizing from mainframes to LANs are finding that the tools they relied on in the past are not available on the LAN. These tools controlled security, availability, data integrity, and ultimately cost control. Business-critical applications and large volumes of data are forcing LAN managers to look for "mainframe-class" storage management capability. These are provided by IDS/SAM.

Storage Management Hardware Platform Requirements

Every hardware platform has its own limitations. Most are designed for general purpose applications. With a dedicated storage management server there are several main requirements which must be addressed in the hardware platform. Among these are: scalability, I/O bandwidth and reliability. Since computing and communications will ordinarily be handled by other machines, it is most efficient and necessary to use a storage management server exclusively for its intended purpose.

Why use UNIX?

UNIX is acknowledged to be the operating system of choice for mainframe downsizing and the evolution of PC LANs into enterprise-wide networks. It is the prerequisite industry-standard open systems environment. The lack of specific storage management functionality was intentional, because the developers of UNIX believed in a compact kernel, foresaw the need for additional features and built in appropriate "hooks" for developers. These hooks allow specialized systems to integrate with UNIX without requiring core (kernel) modifications. This, in essence, is what makes UNIX the best open systems solution.

Virtual File System (vfs) Interface

UNIX offers a standard interface to allow third-party system to "hook" into the system and improve on these basic limitations. This mechanism is the vfs interface, which allows UNIX to accommodate additional file systems such as NFS.

The logical hierarchical structure of the UNIX file system allows vendors to provide new capabilities without straying from the vfs standard.

By utilizing the vfs interface, any file system can become available to the system, and is not dependent on the version(s) of UNIX on the network.

The Integrated Data Server (IDS)

The Integrated Data Server (IDS) is a "plug-and-play" network storage server that provides cost-effective file management, storage, archival and retrieval services. The IDS includes both hardware and software. The hardware consists of a high-performance Sun SPARCserver with greatly expandable peripheral connectivity that can be configured with magnetic cache, disk arrays, optical and tape autochangers. The IDS is delivered in an attractive, well-ventilated cabinet requiring about seven square feet (.6 square meter) of floor space. The Storage and Archive Manager File System (SAM-FS) software provides robust, high-performance hierarchical storage management (HSM) and archiving services.

Scalability

The IDS is designed to address your storage needs today, and in the future. On-line data storage capacity is scalable from several gigabytes to hundreds of terabytes.

Support for Heterogeneous Environments

The IDS is a high-performance NFS storage server that can manage storage of heterogeneous systems on the network, including Sun, Hewlett-Packard, IBM, Digital, SGI, Cray and others.

Data Throughput

The SPARC-based IDS supports Ethernet, FastEthernet, FDDI, and ATM for improved network performance. Fibre Channel and HiPPI will be supported as they become commercially available.

Reliability

The IDS takes advantage of reliability features provided by Sun and Solaris. For higher reliability, an IDS can be configured with disk arrays and redundant power supplies. When reliability is critical, the IDS is available as dual-independent systems with redundant control and data paths to all the data storage devices on the dual systems.

The Storage and Archive Manager File System (SAM-FS)

SAM-FS provides a complete storage management and archiving environment to any network using the NFS protocol.

Extended File Attributes and No O/S Modifications

SAM-FS is a storage and archiving management system with extensive hierarchical storage management (HSM) capabilities. It operates as an additional file system under Solaris 2, requiring no kernel modifications or modifications to the native file system.

The SAM-FS unique features are designed to meet the challenging requirements of managing very large quanities of data over long periods of time in a heterogeneous networking and storage environment.

An Unlimited File System

The capacity of the SAM file system is virtually unlimited not only in its total size, but also in the number of files that may be contained within the file system. The capacity of a SAM file system can expand beyond the traditional limits by allowing a file system to span multiple partitions/disks, and by allowing the file inode information to dynamically grow. This eliminates the traditional UNIX problem of having to estimate disk space for data and inodes at system initialization.

File System Size

Traditional UNIX implementations do not allow file systems to span more than one physical device. SAM-FS enables files to span media by handling multiple on-line storage devices as a "Storage Family Set", and has no limitation on file system size.

Optimized Handling of Small and Large Files

UNIX uses a fixed allocation unit to record files to magnetic disk -- meaning that the file system is optimized for either large or small files. SAM-FS has a patented Dual Storage Allocation Unit (DAU) which allows optimal handling of both small and large files. This allows for high performance and efficient utilization of magnetic disk.

File Archiving

A SAM-FS file can have up to four archive images. Once an archive image of a file is created, the data in that file can be considered backed up. All archive images are written in tar format, preserving not only the data, but also ownership, access control and path name at the time the file is archived.

Associative Staging

Associative staging is an attribute of SAM-FS that can be assigned to a file or directory. When the attribute is enabled, staging a file with this attribute causes every file in the requested file's directory (that also has the attribute enabled) to be staged as well. Associative staging not only reduces manual intervention by the user and speeds access to related files, it also significantly reduces robot motion and media shuffling.

HSM / Removable Media Management

As a network grows to hundreds of gigabytes or terabytes of data, it is not cost-effective or practical to store all that data on magnetic disk. UNIX is not able to handle unmounted removable media -- the platter or tape that is mounted is the only one UNIX is aware of. The SAM File System's archiving and HSM features are designed to support removable media, including media that has been removed from a jukebox. An archiving environment with hundreds or thousands of removable cartridges is useless unless those cartridges can be identified by the system. SAM-FS identifies and manages data by the volume serial name (VSN) of the cartridge on which it resides. In a jukebox, the VSNs are automatically mounted. If the VSN is off-line, SAM-FS notifies the operator that the media should be mounted. The requested media can be mounted in any device -- the scanner automatically locates it and makes it available to the user.

Transparent Access

The operation of SAM-FS is completely transparent to the users. All the user's files, including those which have been archived, appear in the user's directory. When a user requests a file it is either loaded from the local device, or, if previously archived, is loaded from the media on which it resides.

The Operator Command Interface

All operational aspects of SAM-FS can be controlled from three graphical user interfaces (GUIs). Each GUI is designed to dynamically display SAM-FS-related information and to allow either operations or administration staff to interrogate and control SAM storage devices and associated media.

Disaster Recovery Features

Statistics show that 50% of businesses that lose their data due to disaster go out of business. Of those, 90% go out of business within 24 months. (UNIX World, August 1993)

Because archiving systems deal with large amounts of data, SAM-FS is designed to recover from a disaster quickly and efficiently. SAM-FS can help prevent disaster situations and minimize the impact when a disaster does occur.

Recovery from Disk Failure
Should a disk fail, SAM-FS is capable of recovering all data belonging to archived files. If an archive copy exists the disk space assigned to the file is released, leaving the data restorable from the achive image.

Multiple Copies - Support of Multiple Media Types
SAM-FS software's ability to write multiple copies simultaneously allows valuable information to be stored both on-site and off-site, providing additional protection against disasters. It can use disk array technology to ensure that no data will be lost during power interruptions or system fluctuations. SAM-FS supports tape, WORM and rewritable optical devices, enabling data to be stored on media that is appropriate for the required life of that data.

Lights Out Operation
SAM-FS provides features that adhere to the procedures defined by the disaster recovery plan, such as automatically routing data to the specified media, making multiple removable copies of the data (if required), and recording all file information necessary for recovering the data from the media.

Control Structure Dumps
Since the archiving system is managing a large number of files, directory and file location information is vital to the continued operation of the system. Regular "dumps" of such data must be made to removable media to allow a "hot" or "warm" site to restore the file system to its last known state in the event of a disaster.

Data Format - Media Labeling
A workable disaster recovery plan requires that archived information be self-recoverable, even when the system that created the information is unavailable. SAM software records the complete file information in industry standard tar (Tape ARchive) format on all removable media. This enables files to be recovered by any system with a compatible media device, with or without SAM software installed.

SAM provides the flexibility and control needed to implement a disaster recovery plan which can be customized to the particular site's requirements.

Superior Design

The IDS/SAM-FS was designed to maintain compatibility with standard UNIX systems, while at the same time offering additional features. A look at recent history should explain why this is so important:

During the first three or four decades of computing, proprietary systems reigned. This was primarily due to the lack of common standards and the resulting lack of broadly applicable software systems. Since the personal computer revolutionized computing during the 1980's, customers are demanding that systems from multiple vendors cooperate on enterprise-wide networks. This has required a continuing evolution in the computing industry. Until these changes occurred, open systems were not possible to implement.

Of course, the promise of open systems makes that effort worthwhile. Once the decision has been made to produce an "open" product, each development decision should be made with this overall goal in mind.

"Shortcuts" are often taken by manufacturers to reduce development time, to conform to a proprietary system, or simply from lack of commitment. These shortcuts usually mean that the resulting system is not completely open. A truly open implementation will work with any "flavor" of UNIX, and also any system based on open systems standards.

In many applications the life span of the data will most often exceed the life span of the equipment it is stored upon - including the IDS. Therefore, the primary design mission is to insure that the data can be read by any UNIX system (or any system capable of reading a standard recording format). Data must be portable and not be dependent upon a specific vendor for reading..

Truly open systems mean that the applications, networks, and hardware are virtually transparent to the user - the way it should be. The mechanics of computing should be transparent, to allow users to focus on the data - the most important asset in any enterprise. Thus the promise of open systems is that it provides data portability. Storing the data in a proprietary format or in a file system tied to a particular architecture or vendor is risky and inconsistent with the spirit of open systems.

Conclusion

The most secure and efficient storage system is one which matches data and media according to the data's importance to the enterprise. By utilizing each type of media to its fullest, the most cost-effective use of that media is realized.

As technologies come and go, the promise of open systems lies in the portability and security of the data, regardless of platform or operating environment.

The IDS and SAM-FS are designed to support open systems, incorporate standard technology to protect customer investment, and provide broad connectivity and media interchange with other environments. Some IDS/SAM features ae listed below:

  • Sun Solaris 2 host operating system; Sun SPARC processor technology.

  • All media is ANSI-labeled. Thus, all SAM-FS written media can be read on other systems including non-UNIX systems, and all externally-written media can be read by SAM-FS.

  • All archive images ae written in tar format, preserving not only data but also the original path name, ownership and access control.

  • Complete transparency for all file operations (i.e. open, read, write, close, etc.) whether on-line magnetic disk files, off-line archive files, or removable media.

  • Support of standard network protocols and applications (i.e. FTP, NFS, rcp, etc.) and the use of off-the-shelf standard network interfaces (i.e. Ethernet, FastEthernet, FDDI, ATM).

  • Standard off-the-shelf peripheral technology based on SCSI-2 interfaces.

    Return to LSC's Home Page.

    (c)1994 LSC, INC ALL RIGHTS RESERVED
    Integrated Data Server (IDS), Storage and Archiving Manager (SAM-FS) and Fast File Recovery System are trademarks of LSC, Inc. All other trademarks are the property of their respective owners.



    For more information, send us mail inform@lsci.com.